-
Notifications
You must be signed in to change notification settings - Fork 688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: hostname port issue #4816
fix: hostname port issue #4816
Conversation
When unit testing diagd most of the tests us the `econf_compile` helper which by default will assert that no aconf errors existed. This is good but it requires puttng a debugger on it to grab the errors. This prints out the object so that when the test is run outside of an IDE with a debugger the errors can be seen. Altough its not neatly diffed/printed due to being a nested structure. There is an issue to improve this but looks to be still open: pytest-dev/pytest#1531. Signed-off-by: Lance Austin <laustin@datawire.io>
79c3b19
to
692b558
Compare
When trying to expose a Host with a hostname that includes a port such as 'example.com:8500', causes 404 NR's due to the way envoy was configured. In the v1.14, we used to group everything under a wildcard virtual_host domain "*" and did not include sni matching on the Filter Chain. This allowed that configuration to work but had the downside of causing large memory usage and slower route matching due to lumping all routes into a single virtual host. If the v2.Y and v3.Y, series this was addressed by creating seperate Filter Chains for each host. A non-tls Host would get a single Filter Chain with multiple virtual_hosts per Host. A Host with TLS would produce a 1-1 FilterChain and Virtualhost. This works fine in most cases when downstream clients are connecting on standard ports (80/443) but when a client needs to connect on something like example.com:8500 this would effectively generate a Filter Chain that could never match on an incoming request. In the non-tls scenario there are no changes to what gets generated. In the TLS scenario we now parse the hostname into two entites, sni and virtual_host.domain. So, example.com:8500 would have an sni of example.com and a virtual_host.domain of example.com:8500. We create a Filter Chain for the sni value and add a virtual host for the domain value. If a second Host with just `example.com` is provided as well then we will attempt to merge these into a single Filter Chain with multiple virtual_host. We can only do this when the same TLSContext is used because we wouldn't know which attributes to take from which host since all the transport_socket settings are at the Filter Chain level. This restores existing behavior in a backwards compatiabile way and doesn't try to solve the Developer Experience (DX) with the way the Host is currently designed. Signed-off-by: Lance Austin <laustin@datawire.io>
692b558
to
4847a15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about the changelog but otherwise no concerns
@@ -391,17 +394,17 @@ def save_context(self, ir: "IR", ctx_name: str, tls_ss: SavedSecret, tls_name: s | |||
# TLS config is good, let's make sure the hosts line up too. | |||
context_hosts = ctx.get("hosts") | |||
|
|||
# XXX WTF? self.name is not OK as a hostname! Leaving this for the moment, but it's | |||
# almost certainly getting shredded before 2.0 GAs. | |||
host_hosts = [self.name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the ramifications of getting rid of name? Basically why can we do it now and it wasn't done before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we discussed offline, I dug into this deeper and although we don't document this anywhere there is a potential that a user could ignore our docs and intpret the TLSContext.hosts
as needing to match the name of a Host.name
.
Although, it is unlikely and it is not our intended behavior,I will add it back since it is not crucial to the fix for this PR. I will capture it along with the other deprecated TLS features that I outlined and we can plan for its removal in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the overall issue on this topic:
#4826
I make a mention to this and that it is deprecated. I will have a fixup commit here shortly that does remove some dead code where we check self.hostname or self.name
because we default self.hostname to "*" if user doesn't provide it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR I'm the one who wrote that comment. self.name
is not guaranteed to be something that's actually sane for a DNS match, IIRC, so I'm in favor of it going away. However, it used to be important that host_hosts
never be empty; might be worth a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Newer folks, if you see a comment that starts with XXX
, it's almost certainly something I wrote. 🙂)
other = next(iter(self.hosts.values())) | ||
other_type = "TCPMapping" if isinstance(other, IRTCPMappingGroup) else "Host" | ||
tcpmapping.post_error( | ||
"TCPMapping {tcpmapping.name}: discarding because it conflicts with {other_type} {other.name}" | ||
f"TCPMapping {tcpmapping.name}: discarding because it conflicts with {other_type} {other.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we're not going to allow a Host resource with hostname example.com:5000
and a TCPMapping with host example.com
regardless of port specified on the TCPMapping?
What if you specified a different port on the TCPMapping? Theoretically you should be able to put them on different filter chains with different filter chain matching based on SNI & port.
Or this that case handled elsewhere since believe a new listener is created if they aren't any with the specified TCPMapping port
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we're not going to allow a Host resource with hostname example.com:5000 and a TCPMapping with host example.com regardless of port specified on the TCPMapping?
Correct, in v3listeners.finalize
we process TCPMappings first and their filter chains get precedence on a Listener. Luke left a comment in that function that outlines it well, the reasoning for this has to do with providing sensible upgrade paths.
Since TCPMapping gets precedence in the scenario you outline they are both trying to generate a FilterChain with the same FilterChainMatch (TLS & SNI (example.com
) which Envoy doesn't allow. Technically, you could differentiate these filter chains by adding alpn application_protocol: [h2,http/1.1]
and Envoy would accept it but Envoy itself calls out the fact that ALPN is not reliable so I think this would cause more headaches then it benefits.
What if you specified a different port on the TCPMapping? Theoretically you should be able to put them on different filter chains with different filter chain matching based on SNI & port.
Or this that case handled elsewhere since believe a new listener is created if they aren't any with the specified TCPMapping port
Correct, that would work because the TCPMapping would attach to an existing Listener or implicitly create one behind the scenes for that address/port
thus there would be no overlap.
Couple of things
-
I think this function where you commented on, can be confusing because it relies on having knowledge of the TCPMappings taking precedence in
V3Listeners.finalize
and also the method chaining where theV3Listener.add_chain
is shared by TCP/HTTP chains and then returns the V3Chain. So, mentally you have switch context pending which code path led you here. -
While I was looking into this I did find a bug when trying to apply a non-tls cleartext TCPMapping to a listener that serves HTTP. I have captured that here: tcpmappings: invalid envoy configuration generated when applying non-tls TCPMapping on listener with HTTP #4825
@@ -62,7 +64,8 @@ def __init__(self, config: "V3Config", context: Optional["IRTLSContext"]) -> Non | |||
|
|||
self.context = context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is honestly just a nit but I keep getting confused what context stands for and have to constantly look up to see it refers to a TLSContext 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree...but its fairly pervasive and would be out of scope for this PR :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a. @haq204, you're not wrong. 😐
b. A bunch of this is stuff I wrote before TLSContext
existed as a named thing.
c. Sorry. 😇
self.add_chain("tcp", context, group_host).add_tcphost(irgroup) | ||
|
||
sni = group_host.rsplit(":", 1)[0] if group_host else "*" | ||
self.add_chain("tcp", context, group_host, group_host).add_tcphost(irgroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you meant to pass in the sni variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually going to push a fixup commit for this. I had originally done this for symmetry between HTTP and TCP but the TCPMapping already has host
and port
. The port is required but the host is optional if the user wants to do SNI and match against TLSContext. Therefore, for this PR I think it is out-of-scope and we can leave it. But, maybe it would be good to provide a earlier TCPMapping validation on the host to ensure the user hasn't done it. That to me though is a new feature and could be done in a follow-up PR.
Signed-off-by: Lance Austin <laustin@datawire.io>
Signed-off-by: Lance Austin <laustin@datawire.io>
Description
When trying to expose a
Host
with ahostname
that includes a non-standard port such as 'example.com:8500' over a TLS connection, a client would receive a 404 NR's due to Envoy not being able to find a valid route.For example:
In v1.14, emissary would group everything under a single FilterChain with a wildcard virtual_host ("*") and then do :authority header matching using header matches on the Route. This allowed the above configuration to work but had the downside of causing large memory usage and slower route matching due to lumping all routes into a single virtual host.
In v2.Y/ v3.Y, this was addressed by creating separate Filter Chains for each
Host
. A non-tls Host would get a single FilterChain with multiple virtual_hosts perHost
. A Host with TLS will produce a 1-1 FilterChain and virtual_host. This works fine in most cases when downstream clients are connecting over standard ports (80/443) because the dns portion of the:authority
header matched thevirtual_host.domain
. But, when a client needs to connect on something likethe example above this would effectively generate a Filter Chain that could never match on an incoming request due to the overloaded nature of how we use
Host.hostname
.Before Fix (psuedo envoy config):
After Fix
In the non-tls scenario there are no changes to what gets generated since we behave similar to v1.y in that we don't have sni to differentiate FilterChains so virtual_hosts per
Host
are generated along with route header matching.In the TLS scenario we now parse the hostname to determine the sni. SNI doesn't not understand ports which is why the configuration above is effectively dead and no connections would ever be made to that Filter Chain.
So,
example.com:8500
would become:We now create a Filter Chain per sni value and add a virtual host for the full hostname which will now produce:
If a second Host with a
hostname=example.com
is provided then we would have a conflict with the 1 FilterChain to 1 virtual Host design that is used in the TLS scenario. Therefore, we will attempt to merge these into a single Filter Chain withmultiple virtual_host. We can only do this merge when the same TLSContex is used because if we didn't then we wouldn't know which TLS settings because these are set at the FilterChain level and not the virtual host level. This also means that using
Host.tls
is not supported in this scenario because internally those are translated to implicit TLSContext and would have the same issue where two Host could be in conflict.In that scenario we would produce the following:
Related Issues
I might be missing some but these capture it mostly.
Testing
CI is green here and pulled into EdgeStack as well as seen here: https://github.com/datawire/apro/pull/3228
I have added some unit tests that are similar to the existing golden test we had a while back. They allowed for a much faster dev-loop but they may suffer from the same challenges we had before where any touch to the default generated config will require them to be updated. I have left them in place for now but we may want to revisit it if it becomes a pain.
Checklist
CHANGELOG.md
.DEVELOPING.md
with any any special dev tricks I had to use to work on this code efficiently.